popDMS infers mutation effects from deep mutational scanning data

Abstract Summary Deep mutational scanning (DMS) experiments provide a powerful method to measure the functional effects of genetic mutations at massive scales. However, the data generated from these experiments can be difficult to analyze, with significant variation between experimental replicates. To overcome this challenge, we developed popDMS, a computational method based on population genetics theory, to infer the functional effects of mutations from DMS data. Through extensive tests, we found that the functional effects of single mutations and epistasis inferred by popDMS are highly consistent across replicates, comparing favorably with existing methods. Our approach is flexible and can be widely applied to DMS data that includes multiple time points, multiple replicates, and different experimental conditions. Availability and implementation popDMS is implemented in Python and Julia, and is freely available on GitHub at https://github.com/bartonlab/popDMS.


Evolutionary model
We model rounds of selection in deep mutational scanning experiments like rounds of reproduction in an evolving population.For this purpose, we use the Wright-Fisher (WF) model 1 , a simple model from population genetics where individuals in a population undergo discrete rounds of mutation, selection, and reproduction.We define the WF model as follows.We assume that the population consists of N individuals, each of which possesses a genetic sequence of length L. Each site in the genetic sequence can take on one of q possible states, resulting in M = q L possible genotypes.
In the context of DMS experiments, we are typically interested in the properties of proteins with different amino acid variants at each site, and thus we use q = 21 for data analyses (representing 20 amino acids and a stop, which could also be further extended to account for gaps).However, the framework that we consider is more general.One could consider nucleotide sequences with q = 4 states (A, C, T, G), q = 64 codons, and so forth.
At each time t, the state of the population is defined by a genotype frequency vector z(t) = (z 1 (t), z 2 (t), . . ., z M (t)), where z a (t) = n a (t)/N , with n a (t) representing the number of individuals that have genotype a at time t.Under the WF model, the probability of observing genotype frequencies z(t + 1) in the next generation is binomial, (N z a (t + 1))! , (1)   with p a (z(t)) = z a (t)f a + b̸ =a (µ ba z b (t) − µ ab z a (t)) Here f a is the fitness of genotype a, defined in detail below, and µ ab is the probability of mutation from genotype a to genotype b in one generation.In typical experiments, mutation rates are low enough that we assume µ ab is zero across all pairs of genotypes a, b.
We assume that the fitness of each genotype depends linearly on the amino acid (or nucleotide, codon, etc.) content of the sequence, ( In Eq. (3), the s i are selection coefficients for each variant i, which quantify the effect of that variant on fitness.If s i is positive, then the variant is beneficial, and if s i is negative, then the variant is deleterious.Here g a i is an indicator variable, which is equal to one if genotype a possesses the variant i, and zero otherwise.The variant indicator i is a generic index that runs across all possible amino acids or states at each site in the sequence.For example, let us define a genotype sequence a = (T, E, K).For this sequence, , and all other g a i = 0.
Following Eq. ( 1), the probability of a sequence of K genotype frequency (z(t k )) K k=1 = (z(t 1 ), z(t 2 ), . . ., z(t K )), conditioned on an initial distribution of genotype frequencies z(t 0 ), is given by the product of the individual transition probabilities,

Inferring fitness effects of mutations with popDMS
We view sequencing results in a DMS experiment as measurements of the genotype frequency vectors z(t).To infer the functional effects of mutations, we apply Bayes' theorem, seeking the selection coefficients s that maximize the posterior probability of the entire evolutionary trajectory Eq. ( 4).This includes a Gaussian prior distribution for the selection coefficients Here γ encodes of the width of the prior distribution, which can also be thought of as L 2 -norm regularization of the selection coefficients.Since we optimize γ based on the data, our inference framework is not Bayesian in the strict sense, but it is effectively maximum likelihood inference with ridge regression.This prior will act to shrink the estimates of all mutation effects (including the wild-type) toward zero, so that large selection coefficients are not inferred without strong evidence.
The overall posterior distribution for the selection coefficients is then given by where the likelihood of the data L is given by Eq. ( 4).Following recent computational advances 2 , to simplify the likelihood, we consider the diffusion limit of the WF model.In this limit, we assume N is large and the s and µ ab are small, i.e., formally of order O(1/N ).Then we derive the Fokker-Plank (FP) equation 1,3,4 from the WF process Eq. ( 4), which describes the evolution of the probability density of genotype frequencies, (7) Here (d a (z)) a is referred to as the drift and (C ab (z)) ab is the diffusion.Note that there are some differences in terminology between population genetics and FP equations.The "drift" term in the Fokker-Planck equation does not describe genetic drift in population genetics; instead, genetic drift is captured by the "diffusion" term.
The drift and diffusion terms arise from the first and second order cumulants of the binomial process Eq. ( 1), respectively: Drift is characterized by selection pressure and mutation effects.By taking the leading orders in O(1/N ), then drift and diffusion terms can be expressed as: Here, µ fl a is a net flux consisting of incoming probability flux from all genotypes b to a and outgoing flux from genotype a to all other genotypes b (due to mutation).We can then convert the FP equation Eq. ( 7) with drift and diffusion Eq. ( 9) into an expression that describes the probabilities of genotype frequency trajectories, Here, we denote ∆z(t k ) as frequency change z(t k+1 ) − z(t k ).The last expression enables us to obtain an analytical solution for the optimal selection that maximizes the posterior distribution over the evolution Eq. ( 6).Finally, the selection coefficients that maximize Eq. ( 6) are given by ŝi where ∆t k = t k+1 − t k , ∆x j = x j (t K ) − x j (t 0 ), and µ fl j is the net expected change in the frequency of variant j over the course of the experiment due to mutation alone.Typically, µ fl is assumed to be zero, except for experiments involving viral replication, where mutation rates can be high enough to produce observable changes in frequency.Here C(t) is the covariance matrix of variant frequencies x i (t) = M a=1 g a i z a (t), which has entries Here x ij (t) = M a=1 g a i g a j z a (t) is the frequency of genotypes at time t that contain both variants i and j.
The estimate of the selection coefficients ŝi given in Eq. ( 11) can be explained intuitively.First, for simplicity, consider the matrices C(t k ) to be diagonal.Then, the estimate for ŝi depends on how much variant i has increased in frequency over the course of the experiment, after correcting for changes in frequency that are not due to functional selection, (∆x i − µ fl i ).This quantity is normalized by the variance of the variant frequency x i (t k ) over time (Eq.( 12)).In the limit that x i (t k ) is small (and again, that the off-diagonal terms are zero), the estimate for ŝi is similar to an enrichment ratio, because in this limit 1 − x i (t k ) ≈ 1.However, this estimate is also shrunk by a factor of γ due to the prior distribution for the selection coefficients.Importantly, the variance also becomes small when x i (t k ) is close to one, as is often the case for wild-type (WT) or reference amino acids in DMS experiments.This distinguishes the treatment of WT variants in popDMS as compared to ratio-based methods and regression-based methods that do not assume logistic growth.
Off-diagonal terms in Eq. ( 11) account for the influence of genetic background on changes in variant frequency.For example, a variant i may increase in frequency not because it has a beneficial functional effect, but rather because it appears on the same genetic sequence with other beneficial variants more often than expected by chance (i.e., positively covarying with other beneficial variants; see Eq. ( 12)).In population genetics, this phenomenon is referred to as genetic hitchhiking 5 .In DMS data, covariances cannot always be computed due to limited read lengths, but this information can be used to enhance predictions when it is available.
To derive Eq. ( 11), we assumed that the number of individuals in the population, N , is constant.However, in experiments (and in real populations), N can vary in time.Incorporating time-varying population sizes leads to similar estimates of selection, but with a larger uncertainty in the inferred parameters (see ref. 6 for a related model).For simplicity, we will maintain the assumption that N is constant.Additionally, in the discussion below we will absorb the population size N into the definition of γ, so that the strength of the regularization does not rely on an arbitrary definition of population size.

Joint estimates of selection coefficients across experimental replicates
We model experimental replicates as alternative evolutionary histories, subject to the same functional effects of mutations but with different stochastic realizations of evolution (and potentially different starting conditions).The posterior probability for the selection coefficients across R replicates is then given by Here each experimental replicate has a different index r, and the likelihood across all replicates is the product of the likelihood for each individual replicate.Since each L is Gaussian in the selection coefficients, the product is also Gaussian, and the MAP selection coefficients can be computed as in Eq. ( 11), yielding

Correction for sequencing errors
For some data sets, information on sequencing error rates is available.For example, this can be obtained by sequencing a library consisting of all WT sequences, so that all differences from WT are likely attributable to sequencing errors.When this data is available, we compute corrected mutant and WT counts by subtracting the expected contributions from sequencing errors.

Optimizing the regularization strength
For simplicity, we incorporate the WF population size N into the prior parameter γ to define an effective regularization strength γ ′ = γ/N .Larger values of γ ′ put a higher penalty on inferred selection coefficients, thereby suppressing their values, but also limiting the effects of sampling noise.Smaller values of γ ′ allow for the inference of larger selection coefficients, but in turn, these estimates are more sensitive to noise.One can choose a single value of γ ′ to use for all data sets, but this parameter can also easily be optimized for an individual data set.The most computationally intensive step in inferring mutation effects (i.e., selection coefficients) with popDMS is computing the variant frequencies and covariances from sequencing data.After this step has been completed, it is straightforward to sweep through a range of γ ′ values and test their results for each data set.
We found that the average correlation of inferred mutation effects between replicates typically behaves like a logistic function of log (γ ′ ).For very small values of γ ′ , sampling noise is not effectively suppressed, and the correlation of inferred mutation effects between replicates is lower.As γ ′ increases, noise is suppressed, leading to higher correlations between replicates.At high values of γ ′ , high correlations between replicates are typically preserved, but the inferred selection coefficients are shrunk strongly towards zero.
We reasoned that an optimal choice for the regularization strength γ ′ would be the smallest value of γ ′ that effectively suppresses sampling noise, as this would avoid shrinking estimated selection coefficients unnecessarily.To compute this value, for each experimental data set described below, we swept through values of γ ′ in even logarithmically spaced steps from roughly 1/B, where B is the maximum read depth, to 1000.For each value of γ ′ , we computed the correlation between replicates.We then computed the difference ∆R = R max − R min between the maximum correlation and minimum correlation between replicates across all values of γ ′ .To determine the optimum value of γ ′ , we started with the value that corresponds to the maximum correlation between replicates.We then identified the γ ′ where R values drop the most significantly.If R does not decrease more than a threshold, we used the smallest γ value where R decreases by 10% of R max .
While sweeping through values of γ ′ improves our consistency across data sets, allowing us to adjust our regularization to match the level of noise in the data, we emphasize that this step is not essential to obtain robust results.A simple choice of γ ′ = 0.1 is nearly optimal for every data set we considered, with the exception of the influenza PR8 study 7 .This data set is the only one in which the correlation between replicates is not roughly a logistic function of the regularization strength.

Generating logo plots with popDMS
Inferences from DMS data such as amino acid preferences (derived from enrichment ratios) have often been used to generate logo plots that show the relative dominance of different amino acids at each site.However, while preferences naturally sum to one, selection coefficients inferred by popDMS can be both positive and negative.To obtain preference-like logo plots using selection coefficients inferred by popDMS, computed exponentially transformed values where the scaling factor β was approximately chosen to maximize the correlation between the transformed selection coefficients p i and amino acid preferences for the same data set.

Inference of epistasis
We extended our approach to infer pairwise epistatic interactions between variants by adding epistatic interactions s ij to the previous fitness function Eq. ( 3), i.e., As for the selection coefficients defined above, if an epistatic interaction s ij is positive, then the presence of variants i and j together increases fitness more than would be expected from the combined effect of the individual variants.When s ij is negative, variants i and j together are more deleterious than expected if they were independent.With this extension of the fitness model, one can then compute the posterior probability for the change in genotype frequencies, as in Eq. (6).We also assume a Gaussian prior distribution for the epistatic interactions that is centered at zero and has the same width as for the selection coefficients.The MAP selection coefficients for the selection coefficients and epistatic interactions have a form analogous to Eq. ( 11), but with an expanded index that runs over all variants i and all pairs of variants (i, j).Additional terms in the covariance matrix are then given by popDMS differs from some alternatives to estimating epistasis in that information about pairwise interactions is gained from all sequences that bear two or more non-reference variants.For example, one previously developed approach effectively estimated the fitness of sequences with exactly two mutations and compared this with estimates of the fitness for corresponding single mutants to estimate the strength of epistatic interaction between the mutations 8 .
At present, inferring epistatic interactions from DMS data with popDMS is only computationally feasible for short sequences due to the large size of the covariance matrix.Alternative approaches that strictly enforce sparsity and reduce the number of possible interactions to estimate could potentially ease these computational restrictions.

Testing performance in simulations
We simulated evolution following the WF model over a number of generations to test the performance of popDMS.To reproduce finite sampling statistics similar to those observed in experimental data, we used the initial genotype frequency data from an experimental data set 9 .We ordered the variants by frequency at each site and inferred a best-fit multinomial model describing the frequency distribution across sites using PyStan 10 .This inferred distribution thus captures a typical hierarchy of frequencies observed in DMS experiments, from high frequency (WT/reference) variants to rare ones, whose counts may be of the same order as the read depth.
In our simulations, selection coefficients for all variants were chosen at random from a normal distribution with mean zero and standard deviation 0.1.True starting frequencies were sampled at random from the inferred multinomial distribution using PyStan.We then simulated up to 10 generations of evolution following the WF model, here assuming a mutation rate of zero and population size of N = 10 8 .From these true trajectories, we obtained finitely sampled frequency trajectories by multinomial sampling from the true frequencies at each generation, with various choices for the sampling depth.To highlight stochasticity, we used a sampling depth of B = 5 × 10 4 sequences in Supplementary Fig. 1a.
We used this data to compute the average correlation for selection coefficients inferred from different replicates using popDMS, which varies depending on the number of generations of data used (Supplementary Fig. 1b).Intuitively, observing the evolution for a longer time leads to more precise estimates.
We compared the results of popDMS against other common approaches, which we implemented as described below.To compute enrichment ratios, we compare the fraction of reads with a particular variant a pre-and post-selection, Here n a pre and n a post are number of reads with variant a before selection and after selection, respectively.Similarly, B pre and B post represent the total number of reads before and after selection.To compute log ratio scores, we used the natural logarithm of the enrichment ratios, Finally, log ratio regression scores were computed by calculating the logarithm of the enrichment ratio Eq. (20) for each variant at each generation, then extracting the slope of the linear model the best fits the change in log enrichment ratios over time.

Effects of strong noise on read counts
We further extended the simulations described above to model the effects of strong noise in read counts, which can appear in DMS experiments 11,12 .We modeled additional sampling noise for read counts using a negative binomial distribution P NB (λ, r), a heavy-tailed distribution that has been used in prior work to model overdispersion in sequence count distributions.Here, λ and r are the mean (expected) read counts and the dispersion parameter, respectively.We used dispersion parameters derived from the analysis of experimental data: λ a = Bn a /N and r = r(λ) = βλ α with α = 0.69 and β = 0.8 (ref. 13).In the expression for λ a , B is the sample size (i.e., the total number of reads) and N is the total population size.Subsequently, we obtained an ensemble of read counts sampled from the WF model simulations with the fitness function described above.
We then compared the inferred selection coefficients with the true ones (Supplementary Fig. 2), using the same procedure as for experimental data sets.We found that the correlation obtained by popDMS (Pearson's R = 0.94) exceeded ones using log-preferences, log-ratio regression, or log-enrichment ratios (Pearson's R = 0.65, R = 0.53, and R = 0.53, respectively; Supplementary Fig. 2).For the regularization strength, we used the same procedures as for experimental data.We used three replicates for all inference methods.

Comparison with prior studies of epistasis
Here we analyzed a data set from Araya and collaborators, which explored epistasis in the WW domain of the hYAP65 protein 8 .There, they define epistasis in a way that differs from our definition (i.e., the s ij in Eq. ( 16)).For each genotype variant a, Araya et al. define a parameter W a = 2 (S a −S WT ) , where the S a are best-fit slopes of the logarithmic enrichment ratios for variant a. S WT is the slope for the WT variant, which they use to normalize the results.They use the quantity ϵ ab = W ab − W a W b as the primary metric of epistasis.Here, a and b represent genotypes with a single mutation, and ab the genotype that features only these two mutations.
When the frequency of a variant is small, the S a computed by Araya et al. are similar to our f a .Thus, to compare the quantities inferred by Araya et al. to our s ij , we computed a set of transformed scores, which we write as There is good overall agreement in the epistatic interactions s ij inferred by popDMS and the transformed interactions sij , computed from the W values of Araya et al. (Pearson's R = 0.73, Spearman's ρ = 0.75).Figure 2a similarly shows broad agreement between the sum of squared epistatic interactions between variants at each pair of sites in the WW domain, though those inferred by popDMS are sparser (Figure 2b).

Comparison with natural frequencies of influenza variants
In general, it is challenging to validate inferences about the fitness or functional effects of amino acid variants inferred from DMS experiments because "ground truth" measurements for these effects do not exist.However, one possible method of validation is to compare the inferred fitness effects of variants to the frequency of mutations observed in natural populations.This approach was explored by Thyagarajan and collaborators in their study of the effects of mutations in the influenza hemagglutinin protein 21 .
We performed a similar analysis to compare our results to fitness effects inferred using enrichment ratios for the same data set 21 .While it is possible to directly correlate variant frequency and the inferred fitness effect of the variant, this connection is not entirely natural because frequency should be determined not just by the fitness effect of a variant, but also by the relative fitness effects of other possible variants at the same site.
To make a clearer connection with the data, we reasoned that, in most cases, the amino acid with the highest frequency in natural populations should be the variant with the highest fitness at each site.We thus ranked the fitness effects of each amino acid variant at the same site, and computed the rank of the top variant according to both selection coefficients inferred by popDMS and enrichment ratios.For popDMS, the amino acid most frequently observed in natural populations had an average rank of 2.1 across sites (median 1), compared to an average rank of 2.7 (median 1) for enrichment ratios.
To determine the extent to which the amino acid that is most frequently observed in natural populations is predicted to be dominant at each site, we also computed a z score for the most frequent variant at each site.This was computed by taking the metric of fitness (selection coefficients or enrichment ratios) for the most frequent variant at each site, subtracting the mean value for the same site, and dividing by the standard deviation of values at that site.We found an average z score for the most frequent variant of 3.5 using popDMS, compared to 2.6 for enrichment ratios.
Thus, we find that selection coefficients match well with the corresponding frequencies of amino acid variants in a natural population.Results obtained using popDMS also compare favorably with prior results computed using enrichment ratios 21 .

Data and code
Raw data and code used in our analysis are available in the GitHub repository located at https://github.com/bartonlab/paper-DMS-inference.This repository also contains Jupyter notebooks that can be run to reproduce the results presented here.Code for popDMS alone, without the analysis contained in this paper, is also provided in a separate GitHub repository at https://github.com/bartonlab/popDMS.popDMS is coded in Python3 and C++.

. 1 .Supplementary Fig. 2 .
popDMS is robust to finite sampling error.a, Due to finite sampling of the data, variant frequencies can appear to fluctuate over time even if the underlying behavior is smooth, complicating inference.Results from an example simulation (Methods).b, As the number of generations used for inference in simulations increases, all methods become more robust.popDMS is especially robust in inferring mutation effects from limited data with few rounds of selection.Comparison between true and inferred fitness effects of mutations in simulations including overdispersion of variant counts.a, Selection coefficients inferred via popDMS from simulations compared with the underlying true ones (see Methods).Correlations between true and inferred coefficients are higher for popDMS than for alternatives, including (b) log preferences, (c) log ratio regression, and (d) log enrichment ratios.